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There is growing interest in the use of semantic collections in order to identify and analyse 
domain knowledge. This paper describes some technical issues to consider when contemplating 
research which incoiporates small-to-medium domain-specific word sets. The purpose of the 
corpus construction described was to provide an external word collection which could be 
transformed to a numeric frequency scale which could take the place of an “expert” in order to 
evaluate the lexical content of aircraft Visual Landing Approach concept maps. Although this 
paper is based on research in the field of aviation education, the underlying principles are more 
widely applicable. 


General Corpora Definitions and Uses 

The study of naturally occurring word frequencies has been a focus of computational linguistics, 
and in particular of the field of corpus linguistics. A word collection known as a coipus is 
constructed from some set of texts in order to determine what is characteristic of that text set 
through the identification of vocabulary patterns that either differ or conform to a norm (Ide & 
Walker, 1993). In contrast to Chomskyan generative linguistics, which focuses on internal 
knowledge of language structures (Chomsky, 1957), empirical corpus linguistics seeks to describe 
language as it actually occurs, including the number and range of words which are appropriate in 
a defined context and which may not conform to the constraints of correct grammaticality and 
syntactic rules (Sampson, 1987; Kennedy, 1992). The underlying premise or assumption of an 
empirical approach is that semantic content can be meaningfully related to quantifiable word 
patterns in source texts drawn from constrained natural language sub-sets. 

Corpus linguistics uses this representative sample of spoken and/or written words in order to 
provide an authoritative body of linguistic evidence, one which can support generalizations and 
against which hypotheses can be tested (Trask, 1993; Coulthard, 1994). For example, it is 
reasonable to assume that novices and experienced individuals in a specific field may use either 
different domain vocabularies, that is, different words, or different relative distributions of the 
same words (Solomon, 1990). Word frequency distribution may be used in uni-textual analysis or 
for parallel text comparisons (Francis & Kucera, 1982; Sinclair, 1991; Caudery, 1992; Coulthard, 
1994; Davis, Dunning & Ogden, 1995; McEnery & Wilson, 1996). One drawback however is 
that the simple word counts and values derived from those counts are not generally sufficient for 
word sense disambiguation (Toglia et al, 1978; Manning & Schutze, 1999). 

Studies of word frequencies and frequency distributions are not restricted to coipus linguistics. 
Thematic content analysis, which is based in the social sciences (de Sola Pool, 1959; Saporta & 
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Sebeok, 1959; Iker, 1974; Beardsworth, 1980), and literary analysis (Burrows, 1987; Ide, 1989; 
Ide & Walker, 1993) are also concerned with quantifying and interpreting word occurrences, and 
both approaches have benefited in recent years from computerisation (Roller, Mathes & Eckert, 
1995). 

An example of this type of word-based thematic analysis is found in the integrative cognitive 
complexity scale described by Baker-Brown, Ballard, Bluck, de Vries, Suedfeld, and Tetlock 
(1992) in which the presence of text elements gives evidence of progressive differentiation and/or 
integration. This coding system focuses on structure rather than content, and its devisors suggest 
it can be used with any connected verbal discourse. 

Corpus linguistics can complement these other language-based analysis methods (Biber, 1993). 
However corpus linguistics has one specific advantage in that it may require relatively less 
qualitative contextual knowledge for meaningful application (McEnery & Wilson, 1996). As a 
minimum, coipus analysis requires only that any sample of words from a defined domain be a 
domain representative sample (Baayen, 1993). 

Word Frequency Divisions 

A word frequency analysis typically involves raw word counts, ranks, and weights, and the 
comparison of these between different sources. A corpus which has been constructed from a 
representative selection of texts is more likely to demonstrate a range of word frequencies than 
one which has been constructed with bias (McEnery & Wilson, 1996). It follows that when a 
coipus has been derived from naturally occurring texts it may be partitioned into frequency 
divisions which indicate functional differences within its overall word frequency spectrum. 

A high-frequency group typically includes several functionaEstructural words (e.g., to, of, in, at, 
and, for, than) which are indicative of English language structure. High frequency words tend to 
have more diverse meanings than do lower frequency words, which implies a correlation between 
frequency and semantic complexity (Kilgarriff & Rosenzweig, 2000). A relatively high word 
frequency does not imply conceptual validity for an individual word or for the passages in which 
that word participates. For example, individually a high frequency word is just that- a word 
which occurs frequently. The presence of two high frequency words in some sort of relationship 
says nothing about the correctness of that relationship, just that there may be a likelihood of those 
words co-occurring. 

The medium-frequency range of words denotes words of lesser generality but also of repeat 
frequencies (Herdan, 1964). Within this group, given a typical distribution not skewed by 
underlying functions, may be found a class of commonly used content words. 

Low-frequency words tend to bear greater informational value than words which occur more 
frequently (Herdan, 1964). The percentage of rare words as a representative feature of a text 
represents the richness, or diversity, of the text (Liiv, 1997). The size of the group of words 
which occur only once, denoted by the term “hapax legomena”, is a measure of vocabulary 
richness, and grows with an increase in vocabulary, which is an indication of word learning 
(Holmes, 1994). Sichel (1986, p. 53) noted that: 

the number and proportion of hapax legomena have been used to 
measure vocabulary richness in general. A person with a large 
proportion of hapax legomena is considered to command a richer 
vocabulary than one with a low proportion. 
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A related rare word category is the hapax dislegomena, or the collection of text words which are 
used twice. A practical difficulty with compiling word frequency counts from published, 
“sanitized” texts, and from small coipora is that the sample range may not produce reliable counts 
of these low frequency words (Burgess & Livesay, 1998). 

In summary, when a corpus has been derived from naturally occurring texts, functional 
differences may translate into differences in word frequencies. On the other hand, individual 
words with similar frequencies may have different functions. It may be noted when referring to 
word frequencies, whether as raw frequency counts, or as frequency related weights, that similar 
values do not necessarily imply semantic collocations, collocations being characteristic co- 
occurring patterns of words. It is more likely that words of similar frequencies and similar 
functions or meanings will not in fact occur together (Haskel, 1971; Berry-Rogge, 1974; 
Delcourt, 1992), and two words that have similar meanings, individually may occur with similar 
frequencies, but not together unless for emphasis. 

Corpus Size 

Many of the statistics regarding words and their frequency distributions derive from works on 
corpora that involve large numbers (>1,000,000) of words which are in general use (e.g., the 
Brown Corpus of contemporary American English and the British National Corpus). Large 
corpora are generally constructed from a range of diverse text categories, in order to make 
possible cross-category comparisons (Holland & Johansson, 1982; Johansson, 1985). Smaller 
purpose-built coipora deriving from more restricted domains are increasingly common with the 
availability of software (Ide & Walker, 1993; Hickey, 1994). Baayen (1993) noted that a 
relatively small size coipus may be useful if it is used to study only that limited range of topics 
actually contained in the corpus. Therefore, a purpose-built corpus need not approach the size of 
the large general purpose coipora. However, the provision of a workable scale may necessitate the 
incorporation of a much greater word range than the potential word range from the topic under 
study itself. 

Source Selection Issues 

The goal of document location and selection is to create a text collection that would be maximally 
representative of available aviation texts. For a text sample to be of use, it must be typical, 
representative and unbiased, and include, where appropriate, samples of a broad range of authors 
and genres (Biber, 1988, 1993; Althiede, 1996). 

Selection of Corpus Inputs 

The particular coipus described in this paper was constructed from three sources, the desired 
outcome being a representative range of Visual Landing Approach related words from both 
written and spoken sources. The largest source in terms of word number comprises 92 texts from 
aviation documents selected for their relevance to the Visual Landing Approach. Together these 
92 texts are identified as Input A. 

These documents which comprise Input A were products of particular historical and cultural 
influences and were chosen to recreate a broad historical aviation scope, with the emphasis on the 
major themes and sources relating to Visual Landing Approach flight instruction. Each was 
specifically selected in order to mirror the scope of information which might be either provided to 
or be otherwise available to Australian flight instructors and flight students. Aviation museums 
and libraries in Australia, the United Kingdom, Canada, and the United States were contacted. 
These organisations provided roughly half of the documents evaluated for coipus inclusion. 
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Other major sources included active flight training organisations in those countries mentioned and 
private individuals. 

One hundred and sixteen discrete texts were originally evaluated for content and relevance. Of 
these, twenty-four were eliminated because they were deemed to be too technical in their outlook, 
or they would have overly biased the geographical or chronological text distribution. 
Additionally, both the number and diversity of the coipus Input A source texts were intended to 
lessen the possibility that the overall coipus word range would be overly constrictive in 
vocabulary, as both Inputs B and C, described below, were derived solely from contemporary 
Australian sources. 

Of the 92 texts finally selected for Input A, 27 came from the United Kingdom, 47 from the 
United States of America, 1 1 from Australia, 3 from Canada, 3 from New Zealand, and 1 from 
Norway (in English). The texts were drawn from a 90 year range and included both military and 
civil aviation sources, and official and popular genres. These 92 texts did not necessarily comprise 
whole documents. When identifiable Visual Landing Approach themes were embedded in a 
larger document, the relevant sections were selected out of that document. 

These 92 texts may be described as “examples”, as opposed to “samples” chosen by a random 
selection, of Visual Landing Approach writing. As a proper sample suitable for full linguistic 
analysis, this “exampling” method of document selection would have been inappropriate. Neither 
have the document excerpts been chosen through spread sampling, which Yule (1944) suggested 
may be even more closely representative than a random sample of the same size, but criticized as 
being prone to word “clumping” if not done properly. 

Input B consists of the transcriptions of five oral interviews with Australian flight instructors. All 
these interviews were conducted by the researchers during the period 1998 through 2000. Input C 
is composed of the 77 responses provided by Australian General Aviation pilots to an open-ended 
survey questionnaire on Visual Landing Approach expertise conducted by the researchers in 
1999. 

Therefore the Visual Landing Approach corpus comprised three separate input types, two of 
which derive from written sources (82% of total words) and one from oral sources ( 1 8% of total 
words). As a comparison, the 100,000,000 word British National Corpus contains approximately 
10% from spoken data. The general semantic differences between written and spoken genres are 
highlighted in Biber (1988, 1993), Chafe (1986), Hayes (1988), and Halliday (1994). 

Corpus Word Weighting 

The principal purpose of the weighting procedure used in this study was to correct for the over- or 
under- representation of words due to relative size differences in their input sources. A secondary 
purpose was to minimize subjectivity effects arising from document selection and exceiption. 
There are at minimum five different ways of weighting text excerpts, ranging from no weighting 
at all, through intermediate procedures including variations on set block-length averaging, to a 
fully normalized procedure. (J. Burrows, personal communication, March 2000). All sampling 
and weighting procedures are in spirit ultimately based upon the works of Yule (1944) and Zipf 
(1949) although they themselves diverged on how to list different forms of the same word, Zipf 
choosing to include every variation as a separate entry while Yule preferred to group them under 
one heading. 
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Constituent Length Adjustments within Inputs A, B and C 

Prior to applying the word weighting procedure, the constituent units within the corpus were 
adjusted to compensate for large disparities in text length. The 92 texts in Input A had a total 
word count of 46,180. These texts were of varying lengths ranging from 108 to 1 128 words (M = 
501.13, SD 263.887). For the weighting procedure, each text was considered as a component. 

Input B originally consisted of five interviews with a total word count of 10,469. One interview 
contained 8,328 words with the next largest containing 1,053 words. The remaining three 
interviews totaled 1,088 words among them. These three interviews differed from the other two in 
length and from the longest interview also in purpose. Unlike the longest interview, which was 
free-flowing, the three short interviews were essentially single-question probes, as was the second 
longest interview. Based on these considerations, the three short interviews were combined into a 
single unit. Input B therefore consists of three components. 

The 77 survey responses which make up Input C have a total word count of 2,178. The responses 
ranged in length from one to 103 words (M = 28.987, SD 16.509). The relative brevity of the 
responses when compared to the constituent elements of inputs A and B argued against 
maintaining each of the 77 as a separate component for the purpose of word weighting. Therefore 
Input C consists of one component. 

Following these adjustments, which left intact the overall input file word length, the components 
in inputs A, B and C by which the subsequent word weights were derived were as follows: 


Table 1: Number of Corpus Input Words and Components 


Input 

Number of Words 

Number of Components 

A 

46,180 

92 

B 

10,469 

3 

C 

2,178 

1 

Totals 

58,827 

96 


Internal Weighting Factors 

The inputs to this Visual Landing Approach semantic data base consist of text excerpts of widely 
differing lengths. Therefore within each of Input A, Input B, and Input C a simple word count 
would result in relatively greater contributions from the longer extracts. Similarly, because Input 
A, Input B, and Input C contain a differing total number of words, a simple word count would 
result in a greater relative contribution from the longer Input A than from the shorter inputs B and 
C. To compensate for different corpus component lengths, inputs were normalized by an internal 
and an external word weighting procedure. 

Within each of Input A and Input B the total number of words was divided by the number of 
components to obtain a mean word count. For each component this mean word count was divided 
by the number of words in the component to provide an Internal Weighting Index. This 
procedure was not applied to Input C due to the general brevity of the individual components of 
Input C. 
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The mathematical expressions are as follows: 

The mean number of words in Input A is n A where n A is given by: 

1 i=N A 

n A = — 2 n A i (1) 

N a i=l 

where N A = number of components in Input A 

n Ai = number of words in component i of Input A 


The Internal Weighting Index for component i of Input A is t Ai where t Ai is given by: 


n A 


tAi ~ 


n A i 


Similar equations apply to Inputs B and C, i.e., 


and 


1 i = N b flB 

Ab = — 2 nsi , tsi = — 

Nb i = l nsi 

1 i = N c nc 

n c = — Z n Ci , ta = — 
N c i = l n C i 


( 2 ) 


(3) 


(4) 


The derivation of the Internal Weighting Index is illustrated by its application to the interview 
Input B. After adjustments as discussed above, Input B consisted of three components, each with 
a different Internal Weighting Index: 


Table .2: Internal Weighting Index for Input B 


Components 

Number of Words 

Internal Weighting Index 
= Mean/Number of Words 

1 

8,328 

0.419 

2 

1,053 

3.314 

3 

1,088 

3.207 

Total 

10,469 


Mean 

3,490 



That is, each individual word in Input B component 1 is multiplied by 0.419, each word in 
component 2 is multiplied by 3.314, and each word in component 3 is multiplied by 3.207. 
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A like procedure was applied to the document Input A to obtain Internal Word Weighting Indices 
for each of the 92 components. As a length-based Internal Weighting Index was not calculated 
for Input C, the Input C Internal Weighting Index was 1.0. That is, every word in Input C 
counted as one Internal Weight unit. 

External Weighting Factors 

A similar procedure was used to produce an External Weighting Index for each total Input A, 
Input B, and Input C. In this case the total number of words in the three inputs was divided by 
three to obtain a mean word count. For each input A, B, and C this mean word count was divided 
by the number of words in the input to provide an External Weighting Index. 

The mean number of words in the three inputs, A, B and C is n AB c where n AB c is given by: 


i =N A i =N B i =N C 

[ £ n A i +2 n B i + £ n C i ] 

i =1 i =1 i =1 


n A BC 


3 


(5) 


The external weighting factor for Input A is T A and is given by: 
n A BC 

Ta= — 

i=N A (6) 

£ n Ai 
i =1 

Similarly the external weighting factors for inputs B and C are given by: 
n A BC 

Tb= — (7) 

i=N B 

£ n B i 

i =1 

and 

n A BC 

Tc= — (8) 

i — N c 
£ n G 

i =1 


The External Weighting Indices for the three inputs A, B and C are illustrated in Table 3: 
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Table 3: External Weighting Indices for All Inputs 


Input 

Number of Words 

External Weighting Index 
= Mean/Number of Words 

A 

46,180 

0.4246 

B 

10,469 

1.873 

C 

2,178 

9.003 

Total 

58,827 


Mean 

19,609 



The overall weighting factor for any word in a document is given by the product of the internal 
and external weighting factors, e.g., the overall weighting factor for any word in Document i in 
Input A is given by t Ai T A . 


Lexical Issues in Relation to Word Distributions 

Lexical analysis requires a definition of a “word”, and of the forms that a word may take. A 
graphic word may be defined as a sequence of alphanumeric characters surrounded by spaces, and 
may contain punctuation marks (Hofland & Johansson, 1982). A lexical word is one or more 
grammatical words which form a lexical unit. That is, a lexical word fills a single grammatical 
position and has a generally consistent meaning (Francis & Kucera, 1982). A lexical word may 
also be referred to as a lexeme (Trask, 1993). For example, “cat” and “cats” are both particular 
forms of the lexeme “cat”. 

A lexemic word and a graphic word are not necessarily commensurate, and information based on 
word counts may differ slightly between analysis stages and across software. This is because the 
definition of a “word” differs slightly across software, due to the character strings that a given 
software program will recognise as a meaning-conveying string. This can give rise to a class of 
ambiguous (non-interpretable) character strings, as distinguished from the set of unambiguous 
decipherable character strings. Landini (2000) argues for the desirability of deleting ambiguous 
character strings prior to statistical analysis. In the current analysis, a limited group of ambiguous 
character strings were included in word counts because of their minimal effect on overall 
distributions. 

This analysis necessitated additional tests of file characteristics derived from linguistic and 
literary fields in order to demonstrate the properties that held within and across the corpus 
documentary inputs. The research tests employed in the characterisation of textual features were 
two standard Zipf analyses, the Yule’s K word characteristic, and the token/text ratio. A computer 
based concordance program was used in conjunction with Microsoft® Excel and Microsoft® Word 
in the combination of lexemes and computation of lexeme (word-stem) weights. 

Zipf Distribution 

Zipf (1949) described two general properties of natural languages, since accorded the status of 
“laws”, these being the rank-frequency law and the number-frequency law. The rank-frequency 
law says the plot of log (frequency) (y-axis) versus log (rank) (x-axis) approximates a straight line 
of slope -1. Figure 1 shows such a plot for all words in the Visual Landing Approach corpus. 
The line shown has a slope of-1. 

ISSN 1446-5442 Web site: http://www.newcastle.edu.au/journal/ajedp/ 



Construction of a Purpose-Built Domain-Specific Word Corpus - Thomas et al 


21 



Figure 1. The plot of the Visual Landing Approach corpus according to Zipf s rank- frequency 
law. 

The number-frequency law says that, n being a word’s frequency, the plot of log (n) (y axis) 
versus log (number of words with frequency n) (x axis) approximates a straight line of slope -0.5. 
Figure 2 shows such a plot for all words in the Visual Landing Approach coipus. The line shown 
is of slope -0.5. 



Figure 2. The plot of the Visual Landing Approach coipus according to Zipf s number-frequency 
law. 
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Landini (2000) noted that the rank-frequency law tends to be most clearly observed with high 
frequency words, and the number-frequency law with low frequency words (see also Turner, 
1997). These properties hold for a variety of alphabetic, syllabic, and logographic natural 
languages. They also appear to hold for both written and spoken discourses (Balasubrahmanyan 
& Naranan, 1996), with constructed language types (Chen, 1991; Li, 1992), and with small word 
sets (Ridley & Gonzales, 1994). Analysis of semantic distances from the ideal slope may used to 
differentiate between different texts, and may also be used to indicate when a text has been 
selected or edited with bias. 

Figure 1 indicates that, apart from the three most frequent words, the coipus words do generally 
follow a slope of-1, only deviating for words of frequency less than about ten. Furthermore, 
Figure 2 indicates that the low frequency corpus words tend to follow a slope of -0.5. The coipus 
word distribution therefore follows the Zipf “laws”, with the qualification noted by Landini 
( 2000 ). 


Yule’s K 

Yule’s K index of uniformity is a derived characteristic of word distributions which reflects the 
concentration of high frequency words (Yule, 1944) and is a standard index available in the 
Simple Concordance Program 4.0.4 . As a measure of vocabulary richness it is based on the 
assumption that the occurrence of any given word is based on chance and may be regarded as a 
Poisson distribution. A comparatively large K implies that the author’s vocabulary is highly 
concentrated on repeated words, whereas a comparatively small K indicates that the author’s 
vocabulary is less concentrated. Holmes (1994) noted that Yule’s K is constant with respect to 
the size of the sample text, which is a favorable feature in text comparisons. 



Figure 3. Chart of Yule’s K characteristic for Visual Landing Approach corpus Input A and Input 
A coipus sub-divisions by decade and country of origin. 

In Figure 3, the sub-divisions represent different methods of apportioning the texts which made 
up Input A, when the age and provenance of those texts were considered. Yule’s K for the total 
word set in Input A was 138.92. For comparison, the Yule’s K value for Input B was 89.19, and 
for Input C was 101.77. Therefore the words in the document Input A were more concentrated on 
repeated words than were the words from the survey responses (Input C), which in turn were 
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more concentrated on repeated words than the words from the flight instructor interviews in Input 
B. These results were to be expected due to the degree of “focus” of each genre. The relative 
magnitude of these differences is not as great, however, as the differences in Yule’s K values for 
the overlapping sub-sets in Figure 3. The Yule’s K value for the coipus overall, that is Input A, 
Input B, and Input C, was 123.92. That is, when all three inputs were considered together, the 
high frequency concentration within the largest input (Input A) was diluted. This was one of the 
desired outcomes from the combination of different text genres into the Visual Landing Approach 
corpus. 

The Text/Token Ratio 

The concordance program provided the text/token ratios for the word sub-sets in this study. The 
text/token ratio is an indication of vocabulary diversity, for it compares the number of unique 
word types to the overall size of the document(s) from which they are drawn. In Holmes’ (1994) 
review of quantitative stylistic metrics, he noted that the text/token ratio is sensitive to variations 
in document length, and is only useful for document comparison with texts of similar length. The 
effect of text size on the text/token ratio is evident in the values in Figure 4. 


70000 
60000 
50000 
40000 
30000 
20000 
10000 
0 

VLA Input A Input B Input C 
corpus 



Figure 4. Word counts (columns, left scale) and associated text/token ratios (right scale) for 
Visual Landing Approach corpus and corpus sub-divisions. 

Although Inputs A, B, and C together comprise the totality of the Visual Landing Approach 
corpus, individually the input text/token ratios vary among themselves and also when compared 
against the overall corpus value. These ratio differences are noted as confirmation of Holmes 
(1994) but they do not in themselves invalidate the use of the corpus text/token value for this 
research, which is as a general indication of the available number of unique corpus text words 
against which the concept map words could be matched, and not for document comparison per 
se. According to the concordance program the corpus contained 58827 token words with a 
text/token ratio calculated at .0788, indicating a total of 4635 available distinct orthographic 
words available for concept map word comparison. 

Lemmatisation of Corpus Words 

“Lemmatisation” is the term given to the grouping of grammatical words with the same stem or 
the same meaning and which belong to the same major word class, and which differ only in 
inflection or spelling. (Francis & Kucera, 1982). Grouping related words into their respective 

Web site: http://www.newcastle.edu.au/journal/ajedp/ 


ISSN 1446-5442 


Construction of a Purpose-Built Domain-Specific Word Corpus - Thomas et al 


24 


word stems, or lemmas, results in a more representative distribution of word types within a 
defined text. For example, grouping the forms “fly”, “flew”, “flies”, and “flying” into a lemma of 
“fly”, and summing associated values for those forms into that lemma provides a more realistic 
indication of the importance of the concept of “fly” within a word set than allowing the separate 
forms to stand on their own. Therefore lemmatisation of the corpus words was necessary in order 
to reflect the relative importance of those lemmas within the overall Visual Landing Approach 
vocabulary. 

Abbreviations and their related forms were combined when the abbreviation could be interpreted 
unequivocally. Mathematical formulas and symbolic relationships generally were not 
decomposed into their constituent elements. Spelling variants were considered to have semantic 
equivalence. Despite the increasing homogeneity of British, American, and Australian English 
(see Ramson, 1966; Collins & Blair, 1989; Cornelius, 1989; Taylor, 1989), numerous words 
demonstrated both a British and an American orthography (e.g., centre, center; aeroplane, 
airplane). All recognizable variants, including mis-spellings, were grouped with their appropriate 
word stems. 

Various pseudo-words, constructed by the use of a hyphen or slash to form a compound from two 
otherwise unlinked constituents, such as “five-year”, appeared in the documents. In order to 
eliminate these pseudo-words, when the constituent elements were themselves meaningful words 
the hyphen or slash was disregarded, allowing each de-hyphenated element to be treated as a 
token in its own right. 

Individual examples were checked against their entries in Webster’s Dictionary of the English 
Language (McKechnie, 1987), and if there were substantial differences in meaning in derivative 
forms, those forms were not combined. Noun plurals and verb tenses were added to their root 
forms to produce one word. Although most adjectival and adverbial form s were combined and 
their weights added, this was not obligatory. Nominalised verb forms were generally not coupled 
with their verb stems. 

A marked form is a form or construction which differs from another with which it stands in a 
paradigmatic relationship. For example, the lexical items “hostess” and “inconsistent” are marked 
with respect to “host” and “consistent” (Trask, 1993). Generally, marked form s were not 
lemmatised, although individual examples were considered for lemmatisation on a case-by-case 
basis. 

Where only a variant or variants of a noun or a verb were present in the corpus but not the root 
form, one of the variants was selected to bear the weight. Assumptions of equal or similar 
meanings were avoided. Neither were rare usages grouped into superordinate categories in order 
to achieve a match. For example, the mention of a particular aircraft model, such as the Cl 52, 
was not combined into a more general aircraft type, e.g., Cessna. 

The resultant word set may be described as a non-tagged corpus rather than a tagged or a labeled 
corpus. That is, no grammatical tags or moipho-syntactic descriptions were attached to the word 
tokens, such as those used by various coi'pus encoding Text Encoding Initiative conformant 
software. Tagging would have preserved the grammatical structure of the de-contextualized 
coipus content by identifying the specific context source of each word token, thus making 
syntactic comparisons possible between different inputs. However, the concept map content 
which drove the development of this coipus was not necessarily expressed in a developed 
syntactic form, making cross-document syntactic analysis extraneous for this research. 
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Whenever words were combined into word stems, the occurrences (counts) of those words were 
added, and their associated weights were also added arithmetically. This involved the addition of 
those corpus word weights which had resulted from the Internal Weighting Index and External 
Weighting Index procedure. An overall summed weight was produced for each word stem based 
on the total number of word tokens attaching to that particular word-stem. Grouping on word 
stems and related forms did not affect the overall total weights. 

Domain Specific Terms 

With the exception of a small set of words developed specifically to describe aviation phenomena, 
or boiTowed from non-English languages and the use of which is almost exclusively restricted to 
aviation (e.g., ailerons), it was assumed that there are few unambiguously aviation words, and that 
aviation terms are identified through their appearance in an aviation context. Even in an aviation 
context, a word may be used as a non-aviation homograph. For these reasons, there was no a 
priori categorization of aviation and non-aviation specific words. 

Summary 

In any domain, some content will be more central or salient than another. Through the provision 
of a weighted word frequency scale, the use of more domain central words can be highlighted. It 
is also reasonable to assume that novices and experienced individuals in a specific field may use 
either different domain vocabularies, that is, different words, or different relative distributions of 
the same words. Systematic and significant differences in word uses can be identified through a 
quantified scale drawn from an appropriately constructed word collection. 

The corpus described in this paper was constructed on the highly restricted theme of the Visual 
Landing Approach. By confining the topic so narrowly, and incorporating a range of constituent 
texts, a relatively small but useful word set was created. Clearly a coipus of this size is unable to 
fulfill the same functions as do large word sets such as the British National Coipus. However, 
when a restricted research topic such as the Visual Landing Approach can be similarly isolated in 
other fields, and texts are available for incorporation, the construction of a practical purpose-built 
coipus arguably is within the technical ability of most researchers. 
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